Final Project
Section 1: Objective
For this project, I will be showing the log2(fold-change) in differential gene expression for RNA-seq data in two subsets of colon cancer: adenocarcinoma and cystic, mucinous and serous neoplasms. The final deliverable will be the the Glimma Vignette. The input files for the Glimma vignette will be HTSeq-count data obtained from The Cancer Genome Atlas Program (TCGA). I will be turning in an HTML containing my R Notebook.
Section 2: Datasets
Loading data
Data is obtained from TCGA. I filtered for RNA-Seq experimental strategy, TXT data format and HTSeq-counts workflow type. HTSeq-counts is a tool that quantifies the aligned reads overlapping a gene’s exons. HTSeq data does not have a header, is tab-delimited, the first column is the Ensembl gene ID and the second column is the number of mapped reads of the gene. The counts will be used in differential gene expression analysis using edgeR as the method. To look at the differential gene expression, the counts will be normalized using the calcNormFactors in edgeR and only reads that unambigously map to one gene are used.
RAW: URL for cystic, mucinous, and serous neoplasms, I will choose 30:
RAW: URL for adenocarcinoma, I will choose 30:
Unit test for data
The first 15 genes listed in the first column are:
ENSG00000000003.13
ENSG00000000005.5
ENSG00000000419.11
ENSG00000000457.12
ENSG00000000460.15
ENSG00000000938.11
ENSG00000000971.14
ENSG00000001036.12
ENSG00000001084.9
ENSG00000001167.13
ENSG00000001460.16
ENSG00000001461.15
ENSG00000001497.15
ENSG00000001561.6
ENSG00000001617.10
The last 15 lines listed in the first column are:
ENSGR0000263980.4
ENSGR0000264510.4
ENSGR0000264819.4
ENSGR0000265658.4
ENSGR0000270726.4
ENSGR0000275287.3
ENSGR0000276543.3
ENSGR0000277120.3
ENSGR0000280767.1
ENSGR0000281849.1
__no_feature
__ambiguous
__too_low_aQual
__not_aligned
__alignment_not_unique
Unit test for normalized data
Make a box plots of unnormalized and normalized data
par(mfrow=c(1,2))
lcpm <- cpm(x2, log=TRUE)
boxplot(lcpm, las=2, col=col, main="")
title(main="A. Example: Unnormalised data",ylab="Log-cpm")
x2 <- calcNormFactors(x2)
x2$samples$norm.factors
Proposed Analysis
This project will compute and analyze the logarithmic ratio of differential gene expression of two subtypes of colon cancer. EdgeR will be used to import, organize, and normalize the data, Mus.musculus will be used for gene annotions, limma will be used to examine the gene expression anaylsis and make exploratory plots, Glimma will be used to make these plots interactive. RColorBrewer and gplots will be used to make heatmaps.
Flowchart
library(DiagrammeR)
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
tab9 [label = '@@9']
tab10 [label = '@@10']
# edge definitions with the node IDs
tab1 -> tab2;
tab2 -> tab3;
tab3 -> tab4;
tab4 -> tab8 -> tab5;
tab4 -> tab5 -> tab6 -> tab7 -> tab9 -> tab10
}
[1]: 'Download necessary libraries'
[2]: 'Load and read datasets'
[3]: 'Join datasets'
[4]: 'Unit test: Data was properly loaded?'
[5]: 'Normalize data'
[6]: 'Find Mean Varience Trend'
[7]: 'Analyze DE genes'
[8]: 'Troubleshoot and fix errors'
[9]: 'Make Interactive MDS plot'
[10]: 'Make HeatMap of log-CPM data'
")
NA
Loading libraries
library(limma)
library(Glimma)
library(edgeR)
library(Mus.musculus)
library(gplots)
library(RColorBrewer)
Proposed Timeline and Milestones
Week 1: Run the Glimma vignette. I will install the necessary packages in R and understand each step in the vignette.
Week 2: Load in the data (joins, creating datasets) and do a simple, 1 line unit test to look at the data. I will download 60 datasets (30 from each subtype) and join multiple datasets.
Week 3: Confirm that the data was loaded in correctly and analyze data using the Glimma vignette.
Week 4: Troubleshoot for additional errors and enhance the user interface.
User Interface
I anticipate having boxplots, heatmaps, and interactive multi-dimensional scaling (MDS) plots done in an R Notebook. I will submit an HTML page of my completed R Notebook.
MDS plot
library(limma)
lcpm <- cpm(x, log=TRUE)
par(mfrow=c(1,2))
col.group <- group
levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")
col.group <- as.character(col.group)
col.lane <- lane
levels(col.lane) <- brewer.pal(nlevels(col.lane), "Set2")
col.lane <- as.character(col.lane)
plotMDS(lcpm, labels=group, col=col.group)
title(main="A. Sample groups")
plotMDS(lcpm, labels=lane, col=col.lane, dim=c(3,4))
title(main="B. Sequencing lanes")
Make MDS plot interactive
library(glimma)
glMDSPlot(lcpm, labels=paste(group, lane, sep="_"),
groups=x$samples[,c(2,5)], launch=FALSE)
HeatMap of log-CPM data
library(gplots)
basal.vs.lp.topgenes <- basal.vs.lp$ENTREZID[1:100]
i <- which(v$genes$ENTREZID %in% basal.vs.lp.topgenes)
mycol <- colorpanel(1000,"blue","white","red")
heatmap.2(lcpm[i,], scale="row",
labRow=v$genes$SYMBOL[i], labCol=group,
col=mycol, trace="none", density.info="none",
margin=c(8,6), lhei=c(2,10), dendrogram="column")
A script for setting up the Glimma Vignette
{ if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”) BiocManager::install(“limma”) library(limma)
if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”) BiocManager::install(“Glimma”) library(Glimma)
if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)
BiocManager::install(“edgeR”) library(edgeR)
if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)
BiocManager::install(“Mus.musculus”) library(Mus.musculus)
library(R.utils)
if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)
BiocManager::install(“CAMERA”) library(CAMERA) }
Data Packaging
Reading in count data
library(R.utils)
url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63310&format=file"
utils::download.file(url, destfile="GSE63310_RAW.tar", mode="wb")
trying URL 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63310&format=file'
Content type 'application/x-tar' length 1996800 bytes (1.9 MB)
==================================================
downloaded 1.9 MB
utils::untar("GSE63310_RAW.tar", exdir = ".")
files <- c("GSM1545535_10_6_5_11.txt", "GSM1545536_9_6_5_11.txt", "GSM1545538_purep53.txt",
"GSM1545539_JMS8-2.txt", "GSM1545540_JMS8-3.txt", "GSM1545541_JMS8-4.txt",
"GSM1545542_JMS8-5.txt", "GSM1545544_JMS9-P7c.txt", "GSM1545545_JMS9-P8c.txt")
for(i in paste(files, ".gz", sep=""))
R.utils::gunzip(i, overwrite=TRUE)
files <- c("GSM1545535_10_6_5_11.txt", "GSM1545536_9_6_5_11.txt",
"GSM1545538_purep53.txt", "GSM1545539_JMS8-2.txt",
"GSM1545540_JMS8-3.txt", "GSM1545541_JMS8-4.txt",
"GSM1545542_JMS8-5.txt", "GSM1545544_JMS9-P7c.txt",
"GSM1545545_JMS9-P8c.txt")
read.delim(files[1], nrow=5)
Reading the 9 text files into R and combining into a matrix of counts
library(edgeR)
x <- readDGE(files, columns=c(1,3))
class(x)
[1] "DGEList"
attr(,"package")
[1] "edgeR"
dim(x)
[1] 27179 9
Organizing gene annotations
library(Mus.musculus)
geneid <- rownames(x)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"),
keytype="ENTREZID")
'select()' returned 1:many mapping between keys and columns
head(genes)
Resolve duplicate gene IDs
library(Mus.musculus)
genes <- genes[!duplicated(genes$ENTREZID),]
Data Pre-processing
Remove lowly expressed genes
library(edgeR)
table(rowSums(x$counts==0)==9)
FALSE
16624
Filter genes while keeping as many genes as possible with worthwile counts
library(edgeR)
keep.exprs <- filterByExpr(x, group=group)
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
dim(x)
[1] 16624 9
Plot the density of log-CPM values for raw and filtered data
lcpm.cutoff <- log2(10/M + 2/L)
library(RColorBrewer)
nsamples <- ncol(x)
col <- brewer.pal(nsamples, "Paired")
par(mfrow=c(1,2))
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.26), las=2, main="", xlab="")
title(main="A. Raw data", xlab="Log-cpm")
abline(v=lcpm.cutoff, lty=3)
for (i in 2:nsamples){
den <- density(lcpm[,i])
lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col=col, bty="n")
lcpm <- cpm(x, log=TRUE)
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.26), las=2, main="", xlab="")
title(main="B. Filtered data", xlab="Log-cpm")
abline(v=lcpm.cutoff, lty=3)
for (i in 2:nsamples){
den <- density(lcpm[,i])
lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col=col, bty="n")

Normalising gene expression distributions
library(edgeR)
x <- calcNormFactors(x, method = "TMM")
x$samples$norm.factors
[1] 0.8943956 1.0250186 1.0459005 1.0458455 1.0162707 0.9217132 0.9961959 1.0861026 0.9839203
Improve visualization by duplicating data then adjusting the counts
x2 <- x
x2$samples$norm.factors <- 1
x2$counts[,1] <- ceiling(x2$counts[,1]*0.05)
x2$counts[,2] <- x2$counts[,2]*5
Boxplot expression distribution of samples for normalised and unnormalised data
par(mfrow=c(1,2))
lcpm <- cpm(x2, log=TRUE)
boxplot(lcpm, las=2, col=col, main="")
title(main="A. Example: Unnormalised data",ylab="Log-cpm")
x2 <- calcNormFactors(x2)
x2$samples$norm.factors
[1] 0.05770899 6.08287835 1.22023972 1.16478991 1.19661094 1.04659233 1.15048074 1.25431164 1.10901983
lcpm <- cpm(x2, log=TRUE)
boxplot(lcpm, las=2, col=col, main="")
title(main="B. Example: Normalised data",ylab="Log-cpm")

Unsupervised clustering of cells: make multi-dimensional scaling plot (MDS) to show simmilarities and dissimilarities between samples in an unsupervised manner
library(limma)
lcpm <- cpm(x, log=TRUE)
par(mfrow=c(1,2))
col.group <- group
levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")
col.group <- as.character(col.group)
col.lane <- lane
levels(col.lane) <- brewer.pal(nlevels(col.lane), "Set2")
col.lane <- as.character(col.lane)
plotMDS(lcpm, labels=group, col=col.group)
title(main="A. Sample groups")
plotMDS(lcpm, labels=lane, col=col.lane, dim=c(3,4))
title(main="B. Sequencing lanes")
Make interactive using Glimma, html page will be generarted and opened in a browser if launch=TRUE
library(Glimma)
glMDSPlot(lcpm, labels=paste(group, lane, sep="_"),
groups=x$samples[,c(2,5)], launch=FALSE)
Differential Expression Analysis
Creating a design matrix
design <- model.matrix(~0+group+lane)
colnames(design) <- gsub("group", "", colnames(design))
design
Basal LP ML laneL006 laneL008
1 0 1 0 0 0
2 0 0 1 0 0
3 1 0 0 0 0
4 1 0 0 1 0
5 0 0 1 1 0
6 0 1 0 1 0
7 1 0 0 1 0
8 0 0 1 0 1
9 0 1 0 0 1
attr(,"assign")
[1] 1 1 1 2 2
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
attr(,"contrasts")$lane
[1] "contr.treatment"
Contrasts for pairwise comparisons between cell populations
library(Glimma)
contr.matrix <- makeContrasts(
BasalvsLP = Basal-LP,
BasalvsML = Basal - ML,
LPvsML = LP - ML,
levels = colnames(design))
contr.matrix
Contrasts
Levels BasalvsLP BasalvsML LPvsML
Basal 1 1 0
LP -1 0 1
ML 0 -1 -1
laneL006 0 0 0
laneL008 0 0 0
Remove heteroscedascity from count data
library(Glimma)
par(mfrow=c(1,2))
v <- voom(x, design, plot=TRUE)

v
An object of class "EList"
$genes
16619 more rows ...
$targets
$E
Samples
Tags 10_6_5_11 9_6_5_11 purep53 JMS8-2 JMS8-3 JMS8-4 JMS8-5 JMS9-P7c JMS9-P8c
497097 -4.292165 -3.856488 2.5185849 3.2931366 -4.459730 -3.994060 3.2869677 -3.2103696 -5.295316
20671 -4.292165 -4.593453 0.3560126 -0.4073032 -1.200995 -1.943434 0.8442767 -0.3228444 -1.207853
27395 3.876089 4.413107 4.5170045 4.5617546 4.344401 3.786363 3.8990635 4.3396075 4.124644
18777 4.708774 5.571872 5.3964008 5.1623650 5.649355 5.081611 5.0602470 5.7513694 5.142436
21399 4.785541 4.754537 5.3703795 5.1220551 4.869586 4.943840 5.1384776 5.0308985 4.979644
16619 more rows ...
$weights
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1.079413 1.332986 19.826915 20.273253 1.993686 1.395853 20.494977 1.107780 1.079413
[2,] 1.170357 1.456380 4.804866 8.660025 3.612508 2.626870 8.760149 3.211473 2.541942
[3,] 20.219073 25.573792 30.434759 28.528310 31.352260 25.743247 28.722497 21.200072 16.657930
[4,] 26.947557 32.505933 33.583128 33.232125 34.231754 32.354158 33.334340 30.348630 24.259801
[5,] 26.610864 28.501638 33.645479 33.206374 33.573492 31.996623 33.308490 25.171513 23.573305
16619 more rows ...
$design
Basal LP ML laneL006 laneL008
1 0 1 0 0 0
2 0 0 1 0 0
3 1 0 0 0 0
4 1 0 0 1 0
5 0 0 1 1 0
6 0 1 0 1 0
7 1 0 0 1 0
8 0 0 1 0 1
9 0 1 0 0 1
attr(,"assign")
[1] 1 1 1 2 2
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
attr(,"contrasts")$lane
[1] "contr.treatment"
Apply voom precision weights to data
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
plotSA(efit, main="Final model: Mean-variance trend")

Examine the number of DE genes
summary(decideTests(efit))
BasalvsLP BasalvsML LPvsML
Down 4648 4927 3135
NotSig 7113 7026 10972
Up 4863 4671 2517
Set a minimum log-fold change(log-FC) of 1
tfit <- treat(vfit, lfc=1)
dt <- decideTests(tfit)
summary(dt)
BasalvsLP BasalvsML LPvsML
Down 1632 1777 224
NotSig 12976 12790 16210
Up 2016 2057 190
Extract and write results for all 3 comparisons (basalvsLP, basalvsML, and LPvsML) to a single output file
write.fit(tfit, dt, file="results.txt")
Examining individual DE genes from top to bottom
basal.vs.lp <- topTreat(tfit, coef=1, n=Inf)
basal.vs.ml <- topTreat(tfit, coef=2, n=Inf)
head(basal.vs.lp)
head(basal.vs.ml)
Summarize results for genes using mean-difference plots that highlight differentially expressed genes
plotMD(tfit, column=1, status=dt[,1], main=colnames(tfit)[1],
xlim=c(-8,13))

Make interactive mean-difference plot. To open html page in a browser make launch=TRUE.
glMDPlot(tfit, coef=1, status=dt, main=colnames(tfit)[1], side.main="ENTREZID", counts=lcpm, groups=group, launch=TRUE)
Make heatmap
library(gplots)
Attaching package: ‘gplots’
The following object is masked from ‘package:S4Vectors’:
space
The following object is masked from ‘package:stats’:
lowess
basal.vs.lp.topgenes <- basal.vs.lp$ENTREZID[1:100]
i <- which(v$genes$ENTREZID %in% basal.vs.lp.topgenes)
mycol <- colorpanel(1000,"blue","white","red")
heatmap.2(lcpm[i,], scale="row",
labRow=v$genes$SYMBOL[i], labCol=group,
col=mycol, trace="none", density.info="none",
margin=c(8,6), lhei=c(2,10), dendrogram="column")
Error in plot.new() : figure margins too large

Gene set testing by applying the camera method on c2 gene signatures from the Broad Institute’s MSigDB c2 collection
load("/Users/ashleynoriega/Downloads/mouse_c2_v5p2.rdata")
idx <- ids2indices(Mm.c2,id=rownames(v))
cam.BasalvsLP <- camera(v,idx,design,contrast=contr.matrix[,1])
head(cam.BasalvsLP,5)
cam.BasalvsML <- camera(v,idx,design,contrast=contr.matrix[,2])
head(cam.BasalvsML,5)
cam.LPvsML <- camera(v,idx,design,contrast=contr.matrix[,3])
head(cam.LPvsML,5)
barcodeplot(efit$t[,3], index=idx$LIM_MAMMARY_LUMINAL_MATURE_UP,
index2=idx$LIM_MAMMARY_LUMINAL_MATURE_DN, main="LPvsML")

---
title: "TRGN 510 Final Project: Colon Cancer"
output: html_notebook
---

# Final Project

## Section 1: Objective
For this project, I will be showing the log2(fold-change) in differential gene expression for RNA-seq data in two subsets of colon cancer: adenocarcinoma and cystic, mucinous and serous neoplasms. The final deliverable will be the the Glimma Vignette. The input files for the Glimma vignette will be HTSeq-count data obtained from The Cancer Genome Atlas Program (TCGA). I will be turning in an HTML containing my R Notebook.

## Section 2: Datasets

### Loading data
Data is obtained from TCGA. I filtered for RNA-Seq experimental strategy, TXT data format and HTSeq-counts workflow type. HTSeq-counts is a tool that quantifies the aligned reads overlapping a gene's exons. HTSeq data does not have a header, is tab-delimited, the first column is the Ensembl gene ID and the second column is the number of mapped reads of the gene. The counts will be used in differential gene expression analysis using edgeR as the method. To look at the differential gene expression, the counts will be normalized using the calcNormFactors in edgeR and only reads that unambigously map to one gene are used.

RAW: [URL for cystic, mucinous, and serous neoplasms, I will choose 30:](https://portal.gdc.cancer.gov/repository?filters=%7B"op"%3A"and"%2C"content"%3A%5B%7B"content"%3A%7B"field"%3A"cases.case_id"%2C"value"%3A%5B"set_id%3AAW45M6AKTt_rMbGdDakT"%5D%7D%2C"op"%3A"IN"%7D%2C%7B"op"%3A"in"%2C"content"%3A%7B"field"%3A"cases.disease_type"%2C"value"%3A%5B"Cystic%2C%20Mucinous%20and%20Serous%20Neoplasms"%5D%7D%7D%2C%7B"op"%3A"in"%2C"content"%3A%7B"field"%3A"files.analysis.workflow_type"%2C"value"%3A%5B"HTSeq%20-%20Counts"%5D%7D%7D%2C%7B"op"%3A"in"%2C"content"%3A%7B"field"%3A"files.data_format"%2C"value"%3A%5B"TXT"%5D%7D%7D%2C%7B"op"%3A"in"%2C"content"%3A%7B"field"%3A"files.experimental_strategy"%2C"value"%3A%5B"RNA-Seq"%5D%7D%7D%5D%7D&searchTableTab=cases)

RAW: [URL for adenocarcinoma, I will choose 30:](https://portal.gdc.cancer.gov/repository?filters=%7B"op"%3A"and"%2C"content"%3A%5B%7B"content"%3A%7B"field"%3A"cases.case_id"%2C"value"%3A%5B"set_id%3AAW45M6AKTt_rMbGdDakT"%5D%7D%2C"op"%3A"IN"%7D%2C%7B"op"%3A"in"%2C"content"%3A%7B"field"%3A"cases.disease_type"%2C"value"%3A%5B"Adenomas%20and%20Adenocarcinomas"%5D%7D%7D%2C%7B"op"%3A"in"%2C"content"%3A%7B"field"%3A"files.analysis.workflow_type"%2C"value"%3A%5B"HTSeq%20-%20Counts"%5D%7D%7D%2C%7B"op"%3A"in"%2C"content"%3A%7B"field"%3A"files.data_format"%2C"value"%3A%5B"TXT"%5D%7D%7D%2C%7B"op"%3A"in"%2C"content"%3A%7B"field"%3A"files.experimental_strategy"%2C"value"%3A%5B"RNA-Seq"%5D%7D%7D%5D%7D&searchTableTab=cases)

### Unit test for data 
The first 15 genes listed in the first column are: 
````
ENSG00000000003.13
ENSG00000000005.5
ENSG00000000419.11
ENSG00000000457.12
ENSG00000000460.15
ENSG00000000938.11
ENSG00000000971.14
ENSG00000001036.12
ENSG00000001084.9
ENSG00000001167.13
ENSG00000001460.16
ENSG00000001461.15
ENSG00000001497.15
ENSG00000001561.6
ENSG00000001617.10
````
The last 15 lines listed in the first column are:
````
ENSGR0000263980.4	
ENSGR0000264510.4	
ENSGR0000264819.4	
ENSGR0000265658.4	
ENSGR0000270726.4	
ENSGR0000275287.3	
ENSGR0000276543.3	
ENSGR0000277120.3	
ENSGR0000280767.1	
ENSGR0000281849.1	
__no_feature	
__ambiguous	
__too_low_aQual	
__not_aligned	
__alignment_not_unique	
````
### Unit test for normalized data
Make a box plots of unnormalized and normalized data
```{r}
par(mfrow=c(1,2))
lcpm <- cpm(x2, log=TRUE)
boxplot(lcpm, las=2, col=col, main="")
title(main="A. Example: Unnormalised data",ylab="Log-cpm")
x2 <- calcNormFactors(x2)  
x2$samples$norm.factors
```

## Proposed Analysis
This project will compute and analyze the logarithmic ratio of differential gene expression of two subtypes of colon cancer. EdgeR will be used to import, organize, and normalize the data,  Mus.musculus will be used for gene annotions, limma will be used to examine the gene expression anaylsis and make exploratory plots, Glimma will be used to make these plots interactive. RColorBrewer and gplots will be used to make heatmaps.

### Flowchart
```{r}
library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']
      tab6 [label = '@@6']
      tab7 [label = '@@7']
      tab8 [label = '@@8']
      tab9 [label = '@@9']
      tab10 [label = '@@10']

      # edge definitions with the node IDs
      tab1 -> tab2;
      tab2 -> tab3;
      tab3 -> tab4;
      tab4 -> tab8 -> tab5;
      tab4 -> tab5 -> tab6 -> tab7 -> tab9 -> tab10
      }

      [1]: 'Download necessary libraries'
      [2]: 'Load and read datasets'
      [3]: 'Join datasets'
      [4]: 'Unit test: Data was properly loaded?'
      [5]: 'Normalize data'
      [6]: 'Find Mean Varience Trend'
      [7]: 'Analyze DE genes'
      [8]: 'Troubleshoot and fix errors'
      [9]: 'Make Interactive MDS plot'
      [10]: 'Make HeatMap of log-CPM data'
      ")

```


### Loading libraries
```{r}
library(limma)
library(Glimma)
library(edgeR)
library(Mus.musculus)
library(gplots)
library(RColorBrewer)
```

## Proposed Timeline and Milestones
Week 1: Run the Glimma vignette. I will install the necessary packages in R and understand each step in the vignette.

Week 2: Load in the data (joins, creating datasets) and do a simple, 1 line unit test to look at the data. I will download 60 datasets (30 from each subtype) and join multiple datasets.

Week 3: Confirm that the data was loaded in correctly and analyze data using the Glimma vignette.

Week 4: Troubleshoot for additional errors and enhance the user interface.

## User Interface
I anticipate having boxplots, heatmaps, and interactive multi-dimensional scaling (MDS) plots done in an R Notebook. I will submit an HTML page of my completed R Notebook.

### MDS plot
```{r}
library(limma)
lcpm <- cpm(x, log=TRUE)
par(mfrow=c(1,2))
col.group <- group
levels(col.group) <-  brewer.pal(nlevels(col.group), "Set1")
col.group <- as.character(col.group)
col.lane <- lane
levels(col.lane) <-  brewer.pal(nlevels(col.lane), "Set2")
col.lane <- as.character(col.lane)
plotMDS(lcpm, labels=group, col=col.group)
title(main="A. Sample groups")
plotMDS(lcpm, labels=lane, col=col.lane, dim=c(3,4))
title(main="B. Sequencing lanes")
```

### Make MDS plot interactive
```{r}
library(glimma)
glMDSPlot(lcpm, labels=paste(group, lane, sep="_"), 
          groups=x$samples[,c(2,5)], launch=FALSE)
```

### HeatMap of log-CPM data
```{r}
library(gplots)
basal.vs.lp.topgenes <- basal.vs.lp$ENTREZID[1:100]
i <- which(v$genes$ENTREZID %in% basal.vs.lp.topgenes)
mycol <- colorpanel(1000,"blue","white","red")
heatmap.2(lcpm[i,], scale="row",
   labRow=v$genes$SYMBOL[i], labCol=group, 
   col=mycol, trace="none", density.info="none", 
   margin=c(8,6), lhei=c(2,10), dendrogram="column")
```


# Ashley E Noriega,
# Nov 13, 2019
# TRGN 510 Final Project: Milestone 1
# A script for setting up the Glimma Vignette
{
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("limma")
library(limma)

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Glimma") 
library(Glimma)

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
BiocManager::install("edgeR")
library(edgeR)

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Mus.musculus")
library(Mus.musculus)

library(R.utils)

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("CAMERA")
library(CAMERA)
}

# Data Packaging

## Reading in count data
```{r}
library(R.utils)
url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63310&format=file"
utils::download.file(url, destfile="GSE63310_RAW.tar", mode="wb") 
utils::untar("GSE63310_RAW.tar", exdir = ".")
files <- c("GSM1545535_10_6_5_11.txt", "GSM1545536_9_6_5_11.txt", "GSM1545538_purep53.txt",
  "GSM1545539_JMS8-2.txt", "GSM1545540_JMS8-3.txt", "GSM1545541_JMS8-4.txt",
  "GSM1545542_JMS8-5.txt", "GSM1545544_JMS9-P7c.txt", "GSM1545545_JMS9-P8c.txt")
for(i in paste(files, ".gz", sep=""))
  R.utils::gunzip(i, overwrite=TRUE)
files <- c("GSM1545535_10_6_5_11.txt", "GSM1545536_9_6_5_11.txt",
"GSM1545538_purep53.txt", "GSM1545539_JMS8-2.txt", 
   "GSM1545540_JMS8-3.txt", "GSM1545541_JMS8-4.txt", 
   "GSM1545542_JMS8-5.txt", "GSM1545544_JMS9-P7c.txt", 
   "GSM1545545_JMS9-P8c.txt")
read.delim(files[1], nrow=5)
```

## Reading the 9 text files into R and combining into a matrix of counts
```{r}
library(edgeR)
x <- readDGE(files, columns=c(1,3))
class(x)
dim(x)
```

## Organizing sample information
```{r}
library(edgeR)
samplenames <- substring(colnames(x), 12, nchar(colnames(x)))
samplenames
colnames(x) <- samplenames
group <- as.factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", 
                     "Basal", "ML", "LP"))
x$samples$group <- group
lane <- as.factor(rep(c("L004","L006","L008"), c(3,4,2)))
x$samples$lane <- lane
x$samples
```

## Organizing gene annotations
```{r}
library(Mus.musculus)
geneid <- rownames(x)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"), 
                keytype="ENTREZID")
head(genes)
```

## Resolve duplicate gene IDs
```{r}
library(Mus.musculus)
genes <- genes[!duplicated(genes$ENTREZID),]
```

## Package in a DGEList-object containing raw count data with associated sample information and gene annotations
```{r}
library(Mus.musculus)
x$genes <- genes
x
```

# Data Pre-processing

## Transformations from the raw-scale: convert raw counts to counts per million (CPM) and log2-counts per million (log-CPM)
```{r}
library(edgeR)
cpm <- cpm(x)
lcpm <- cpm(x, log=TRUE)
L <- mean(x$samples$lib.size) * 1e-6
M <- median(x$samples$lib.size) * 1e-6
c(L, M)
summary(lcpm)
```

## Remove lowly expressed genes
```{r}
library(edgeR)
table(rowSums(x$counts==0)==9)
```

## Filter genes while keeping as many genes as possible with worthwile counts
```{r}
library(edgeR)
keep.exprs <- filterByExpr(x, group=group)
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
dim(x)
```

## Plot the density of log-CPM values for raw and filtered data
```{r}
lcpm.cutoff <- log2(10/M + 2/L)
library(RColorBrewer)
nsamples <- ncol(x)
col <- brewer.pal(nsamples, "Paired")
par(mfrow=c(1,2))
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.26), las=2, main="", xlab="")
title(main="A. Raw data", xlab="Log-cpm")
abline(v=lcpm.cutoff, lty=3)
for (i in 2:nsamples){
den <- density(lcpm[,i])
lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col=col, bty="n")
lcpm <- cpm(x, log=TRUE)
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.26), las=2, main="", xlab="")
title(main="B. Filtered data", xlab="Log-cpm")
abline(v=lcpm.cutoff, lty=3)
for (i in 2:nsamples){
den <- density(lcpm[,i])
lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col=col, bty="n")
```

## Normalising gene expression distributions
```{r}
library(edgeR)
x <- calcNormFactors(x, method = "TMM")
x$samples$norm.factors
```

## Improve visualization by duplicating data then adjusting the counts
```{r}
x2 <- x
x2$samples$norm.factors <- 1
x2$counts[,1] <- ceiling(x2$counts[,1]*0.05)
x2$counts[,2] <- x2$counts[,2]*5
```

## Boxplot expression distribution of samples for normalised and unnormalised data
```{r}
par(mfrow=c(1,2))
lcpm <- cpm(x2, log=TRUE)
boxplot(lcpm, las=2, col=col, main="")
title(main="A. Example: Unnormalised data",ylab="Log-cpm")
x2 <- calcNormFactors(x2)  
x2$samples$norm.factors
lcpm <- cpm(x2, log=TRUE)
boxplot(lcpm, las=2, col=col, main="")
title(main="B. Example: Normalised data",ylab="Log-cpm")
```

## Unsupervised clustering of cells: make multi-dimensional scaling plot (MDS) to show simmilarities and dissimilarities between samples in an unsupervised manner
```{r}
library(limma)
lcpm <- cpm(x, log=TRUE)
par(mfrow=c(1,2))
col.group <- group
levels(col.group) <-  brewer.pal(nlevels(col.group), "Set1")
col.group <- as.character(col.group)
col.lane <- lane
levels(col.lane) <-  brewer.pal(nlevels(col.lane), "Set2")
col.lane <- as.character(col.lane)
plotMDS(lcpm, labels=group, col=col.group)
title(main="A. Sample groups")
plotMDS(lcpm, labels=lane, col=col.lane, dim=c(3,4))
title(main="B. Sequencing lanes")
```

## Make interactive using Glimma, html page will be generarted and opened in a browser if launch=TRUE
```{r}
library(Glimma)
glMDSPlot(lcpm, labels=paste(group, lane, sep="_"), 
          groups=x$samples[,c(2,5)], launch=FALSE)

```

# Differential Expression Analysis

## Creating a design matrix
```{r}
design <- model.matrix(~0+group+lane)
colnames(design) <- gsub("group", "", colnames(design))
design
```

## Contrasts for pairwise comparisons between cell populations
```{r}
library(Glimma)
contr.matrix <- makeContrasts(
   BasalvsLP = Basal-LP, 
   BasalvsML = Basal - ML, 
   LPvsML = LP - ML, 
   levels = colnames(design))
contr.matrix
```

## Remove heteroscedascity from count data
```{r}
library(Glimma)
par(mfrow=c(1,2))
v <- voom(x, design, plot=TRUE)
v
```

## Apply voom precision weights to data
```{r}
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
plotSA(efit, main="Final model: Mean-variance trend")
```


## Examine the number of DE genes
```{r}
summary(decideTests(efit))
```

## Set a minimum log-fold change(log-FC) of 1
```{r}
tfit <- treat(vfit, lfc=1)
dt <- decideTests(tfit)
summary(dt)
```

## Extract genes that are DE in multiple comparisons
```{r}
de.common <- which(dt[,1]!=0 & dt[,2]!=0)
length(de.common)
head(tfit$genes$SYMBOL[de.common], n=20)
vennDiagram(dt[,1:2], circle.col=c("turquoise", "salmon"))
```

## Extract and write results for all 3 comparisons (basalvsLP, basalvsML, and LPvsML) to a single output file
```{r}
write.fit(tfit, dt, file="results.txt")
```

## Examining individual DE genes from top to bottom
```{r}
basal.vs.lp <- topTreat(tfit, coef=1, n=Inf)
basal.vs.ml <- topTreat(tfit, coef=2, n=Inf)
head(basal.vs.lp)
```

```{r}
head(basal.vs.ml)
```

## Summarize results for genes using mean-difference plots that highlight differentially expressed genes
```{r}
plotMD(tfit, column=1, status=dt[,1], main=colnames(tfit)[1], 
       xlim=c(-8,13))
```

## Make interactive mean-difference plot. To open html page in a browser make launch=TRUE.
```{r}
glMDPlot(tfit, coef=1, status=dt, main=colnames(tfit)[1],         side.main="ENTREZID", counts=lcpm, groups=group, launch=TRUE)
```

## Make heatmap
```{r}
library(gplots)
basal.vs.lp.topgenes <- basal.vs.lp$ENTREZID[1:100]
i <- which(v$genes$ENTREZID %in% basal.vs.lp.topgenes)
mycol <- colorpanel(1000,"blue","white","red")
heatmap.2(lcpm[i,], scale="row",
   labRow=v$genes$SYMBOL[i], labCol=group, 
   col=mycol, trace="none", density.info="none", 
   margin=c(8,6), lhei=c(2,10), dendrogram="column")
```

# Gene set testing by applying the camera method on c2 gene signatures from the Broad Institute’s MSigDB c2 collection
```{r}
load("/Users/ashleynoriega/Downloads/mouse_c2_v5p2.rdata")
idx <- ids2indices(Mm.c2,id=rownames(v))
cam.BasalvsLP <- camera(v,idx,design,contrast=contr.matrix[,1])
head(cam.BasalvsLP,5)
cam.BasalvsML <- camera(v,idx,design,contrast=contr.matrix[,2])
head(cam.BasalvsML,5)
cam.LPvsML <- camera(v,idx,design,contrast=contr.matrix[,3])
head(cam.LPvsML,5)
```


```{r}
barcodeplot(efit$t[,3], index=idx$LIM_MAMMARY_LUMINAL_MATURE_UP, 
            index2=idx$LIM_MAMMARY_LUMINAL_MATURE_DN, main="LPvsML")
```

